11 research outputs found
Syn-QG: Syntactic and Shallow Semantic Rules for Question Generation
Question Generation (QG) is fundamentally a simple syntactic transformation;
however, many aspects of semantics influence what questions are good to form.
We implement this observation by developing Syn-QG, a set of transparent
syntactic rules leveraging universal dependencies, shallow semantic parsing,
lexical resources, and custom rules which transform declarative sentences into
question-answer pairs. We utilize PropBank argument descriptions and VerbNet
state predicates to incorporate shallow semantic content, which helps generate
questions of a descriptive nature and produce inferential and semantically
richer questions than existing systems. In order to improve syntactic fluency
and eliminate grammatically incorrect questions, we employ back-translation
over the output of these syntactic rules. A set of crowd-sourced evaluations
shows that our system can generate a larger number of highly grammatical and
relevant questions than previous QG systems and that back-translation
drastically improves grammaticality at a slight cost of generating irrelevant
questions.Comment: Some of the results in the paper were incorrec
CANDLE: Decomposing Conditional and Conjunctive Queries for Task-Oriented Dialogue Systems
Domain-specific dialogue systems generally determine user intents by relying
on sentence-level classifiers which mainly focus on single action sentences.
Such classifiers are not designed to effectively handle complex queries
composed of conditional and sequential clauses that represent multiple actions.
We attempt to decompose such queries into smaller single-action sub-queries
that are reasonable for intent classifiers to understand in a dialogue
pipeline. We release CANDLE (Conditional & AND type Expressions), a dataset
consisting of 3124 utterances manually tagged with conditional and sequential
labels and demonstrates this decomposition by training two baseline taggers
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets
Machine learning approaches applied to NLP are often evaluated by summarizing
their performance in a single number, for example accuracy. Since most test
sets are constructed as an i.i.d. sample from the overall data, this approach
overly simplifies the complexity of language and encourages overfitting to the
head of the data distribution. As such, rare language phenomena or text about
underrepresented groups are not equally included in the evaluation. To
encourage more in-depth model analyses, researchers have proposed the use of
multiple test sets, also called challenge sets, that assess specific
capabilities of a model. In this paper, we develop a framework based on this
idea which is able to generate controlled perturbations and identify subsets in
text-to-scalar, text-to-text, or data-to-text settings. By applying this
framework to the GEM generation benchmark, we propose an evaluation suite made
of 80 challenge sets, demonstrate the kinds of analyses that it enables and
shed light onto the limits of current generation models
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken
A Bird's-Eye Tutorial of Graph Attention Architectures
Graph Neural Networks (GNNs) have shown tremendous strides in performance for
graph-structured problems especially in the domains of natural language
processing, computer vision and recommender systems. Inspired by the success of
the transformer architecture, there has been an ever-growing body of work on
attention variants of GNNs attempting to advance the state of the art in many
of these problems. Incorporating "attention" into graph mining has been viewed
as a way to overcome the noisiness, heterogenity and complexity associated with
graph-structured data as well as to encode soft-inductive bias. It is hence
crucial and advantageous to study these variants from a bird's-eye view to
assess their strengths and weaknesses. We provide a systematic and focused
tutorial centered around attention based GNNs in a hope to benefit researchers
dealing with graph-structured problems. Our tutorial looks at GNN variants from
the point of view of the attention function and iteratively builds the reader's
understanding of different graph attention variants.Comment: 8 pages Tutoria
The GEM Benchmark:Natural Language Generation, its Evaluation and Metrics
We introduce GEM, a living benchmark for natural language Generation (NLG),
its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly
evolving ecosystem of automated metrics, datasets, and human evaluation
standards. Due to this moving target, new models often still evaluate on
divergent anglo-centric corpora with well-established, but flawed, metrics.
This disconnect makes it challenging to identify the limitations of current
models and opportunities for progress. Addressing this limitation, GEM provides
an environment in which models can easily be applied to a wide set of tasks and
in which evaluation strategies can be tested. Regular updates to the benchmark
will help NLG research become more multilingual and evolve the challenge
alongside models. This paper serves as the description of the data for which we
are organizing a shared task at our ACL 2021 Workshop and to which we invite
the entire NLG community to participate
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting